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Abstract 

We consider the problem of estimating the parameters of a Gaussian or binary distribution 
in such a way that the resulting undirected graphical model is sparse. Our approach 
is to solve a maximum likelihood problem with an added £i-norm penalty term. The 
problem as formulated is convex but the memory requirements and complexity of existing 
interior point methods are prohibitive for problems with more than tens of nodes. We 
present two new algorithms for solving problems with at least a thousand nodes in the 
Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted 
as recursive ^i-norm penalized regression. Our second algorithm, based on Nesterov's first 
order method, yields a complexity estimate with a better dependence on problem size than 
existing i nterior point methods. Using a log determinant relaxation of the log partition 
function ( Wainwright and JordanI 2009]), we show that these same algorithms can be used 
to solve an approximate sparse maximum likelihood problem for the binary case. We test 
our algorithms on synthetic data, as well as on gene expression and senate voting records 
data. 



Keywords: Model Selection, Maximum Likelihood Estimation, Convex Optimization, 
Gaussian Graphical Model, Binary Data 
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1. Introduction 



Undirected graphical models offer a way to describe and explain the relationships among a 
set of variables, a central element of multivariate data analysis. The principle of parsimony 
dictates that we should select the simplest graphical model that adequately explains the 
data. In this paper weconsider practical ways of implementing the following approach to 
finding such a model: given a set of data, we solve a maximum likelihood problem with an 
added £i-norm penalty to make the resulting graph as sparse as possible. 



Many authors have studied a variety of related ideas. In the Gaussian case, model selection 
involves finding the pattern of zeros in the inverse covariance matrix, since these zeros 
correspond to conditional independencies among the variables. Traditionally , a greedy 



forwa rd-backward search algorithm is used to determine the zero pattern [e.g., iLauritzen 



19961]. Ho wever, this i s comp utationally infeasible for data with even a moderate number of 



mp 

variables. iLi and Guil [2005l | introduce a gradient descent algorithm in which they account 



for the sparsity of the inverse covariance matri x by defining a loss function that is the 



negative of the log likelihood functi on. Recently. iHuang et al.l j2005l | considered penalized 



maximum likelihood estimation, and Dahl et al. 2006| | proposed a set of large scale methods 
for problems where a sparsity pattern for the inverse covariance is given and one must 
estimate the nonzero elements of the matrix. 



Another way to estimate the graphical model is to find the set of neighbors of each node 
in the graph by regre ssing that variable against the remaining variables. In this vein. 



Dobra and WestI 2004i | employ a stochastic algorithm to manage tens of thousands of vari- 



ables. There has also been a great dea l of interest in using £i-norm penalties in statistical 



applications. Id'Aspremont et al.l j2004i | apply an ii norm penalty to spa rse principle com po- 



nent analysis. Directly related to our problem is the use of the Lasso oflTibshiranil 1996] to 
obtain a very short list of neighbors for each node in the graph. iMeinshausen and Biihlmann 
study this approach in detail, and show that the resulting estimator is consistent. 



even for high-dimensional graphs. 



The problem formulation for Gaussian data, therefore, is simple. The difficulty lies in its 
computation. Although the problem is convex, it is non-smooth and has an unbounded 
constraint set. As we shall see, the resulting complexity for existing interior point methods 
is 0{p^), where p is the number of variables in the distribution. In addition, interior point 
methods require that at each step we compute and store a Hessian of size O(p^). The 
memory requirements and complexity are thus prohibitive for 0{p) higher than the tens. 
Specialized algorithms are needed to handle larger problems. 

The remainder of the paper is organized as follows. We begin by considering Gaussian 
data. In Section [2] we set up the problem, derive its dual, discuss properties of the solution 
and how heavily to weight the £i-norm penalty in our problem. In Section [3] we present a 
provably convergent block coordinate descent algorithm that can be interpreted as recursive 
£i-norm penalized regression. In Section ?? we present a second, alternative algorithm based 
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on Nesterov's recent work on non-smooth optimization, and give a rigorous complexity 
analysis with better dependence on problem size than interior point methods. In Section ?? 
we show that the algorithms we developed for the Gaussian case can also be used to solve 
an approximate sparse maximum likelihood problem for multivari ate binary data, using a 



log de terminant relaxation for the log partition function given by IWainwright and Jordan 



2006l |. In Section El we test our methods on synthetic as well as gene expression and senate 



voting records data. 



2. Problem Formulation 



In this section we set up the sparse maximum likelihood problem for Gaussian data, derive 
its dual, and discuss some of its properties. 



2.1 Problem setup. 

Suppose we are given n samples independently drawn from a p-variate Gaussian distribution: 
y^^\ . . . ,y^"'^ ~ N'{T,p,ij), where the covariance matrix S is to be estimated. Let S denote 
the second moment matrix about the mean: 



5:=-X;(2/('=)-/x)(|/('^)-/.)^. 

r) ^ — ^ 



n 
k=l 



Let S ^ denote our estimate of the inverse covariance matrix. Our estimator takes the 
form: 

= arg max log det X — trace(S'X) — A||X||i. (1) 

Here, ||X||i denotes the sum of the absolute values of the elements of the positive definite 
matrix X. 

The scalar parameter A controls the size of the penalty. The penalty term is a proxy for 
the number of nonzero elements in X, and is often used - albiet with vector, not matrix, 
variables - in regression techniques, such as the Lasso. 

In the case where S >- the classical maximum likelihood estimate is recovered for A = 0. 
However, when the number of samples n is small compared to the number of variables p, 
the second moment matrix may not be invertible. In such cases, for A > 0, our estimator 
performs some regularization so that our estimate S is always invertible, no matter how 
small the ratio of samples to variables is. 
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Even in cases where we have enough samples so that S y 0, the inverse S may not 
be sparse, even if there are many conditional independencies among the variables in the 
distribution. By trading off maximality of the log likelihood for sparsity, we hope to find a 
very sparse solution that still adequately explains the data. A larger value of A corresponds 
to a sparser solution that fits the data less well. A smaller A corresponds to a solution that 
fits the data well but is less sparse. The choice of A is therefore an important issue that will 
be examined in detail in Section [2.31 

2.2 The dual problem and bounds on the solution. 

We can write ([I]) as 

max min log det X + trace(X, S + U) 

l|t^l|oo<A 

where ||C/||oo denotes the maximum absolute value element of the symmetric matrix U. This 
corresponds to seeking an estimate with the maximum worst-case log likelihood, over all 
additive perturbations of the second moment matrix S. A similar robustness interpretation 
can be made for a number of estimation problems, such as support vector machines for 
classification. 

We can obtain the dual problem by exchanging the max and the min. The resulting inner 
problem in X can be solved analytically by setting the gradient of the objective to zero and 
solving for X. The result is 

min — log detfS* + U) — p 

where the primal and dual variables are related as: X = (S + U)^^ . Note that the log 
determinant function acts a log barrier, creating an implicit constraint that S + U >- 0. 

To write things neatly, let W = S + U. Then the dual of our sparse maximum likelihood 
problem is 

S :=max{logdetW^ : ||l^-5||oo < A}. (2) 

Observe that the dual problem ([1]) estimates the covariance matrix while the primal problem 
estimates its inverse. We also observe that the diagonal elements of the solution are S^fe = 
Skk + ^ for all k. 

The following theorem shows that adding the ^i-norm penalty regularlizes the solution. 

Theorem 1 For every A > 0, the optimal solution to (CP is unique, with bounded eigenval- 
ues: 

T>\\^'%>{\\Sh + Xp)-'- 



Here, ||^||2 denotes the maximum eigenvalue of a symmetric matrix A. 
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The dual problem ([2]) is smooth and convex. When p{p + l)/2 is in the low hundreds 
the problem can be solved by existing software that uses an interior point method [e.g. 



Vandenberghe et al. 



1998| |. The complexity to compute an e-suboptimal solution using 
such second-order methods, however, is log(l/e)), making them infeasible when p is 
larger than the tens. 



A related problem, solved by Dahl et al. 2006l |. is to compute a maximum likelihood es- 



timate of the covariance matrix when the sparsity structure of the inverse is known in 
advance. This is accomplished by adding constraints to ([T]) of the form: Xij = for all 
pairs in some specified set. Our constraint set is unbounded as we hope to uncover 
the sparsity structure automatically, starting with a dense second moment matrix S. 



2.3 Choice of penalty parameter. 



Consider the true, unknown graphical model for a given distribution. This graph has p 
nodes, and an edge between nodes k and j is missing if variables k and j are independent 
conditional on the rest of the variables. For a given node k, let denote its connectivity 
component: the set of all nodes that are connected to node k through some chain of edges. 
In particular, if node j Cfc, then variables j and k are independent. 



We would like to choose the penalty parameter A so that, for finite samples, the probability 
of error in estimating the graph i cal in odel is controlled. To this end, we can adapt the work 
of Meinshausen and Biihlmann 20061 ] as follows. Let denote our estimate of the connec- 
tivity component of node k. In the context of our optimization problem, this corresponds 
to the entries of row A: in S that are nonzero. 



Let a be a given level in [0, 1] . Consider the following choice for the penalty parameter in 

\[a) := (max crjCTjj — (oj 

• ln-2 + tl_,{a/2p^) 



where i„_2(a) denotes the (100 — a)% point of the Student's t-distribution for n — 2 degrees 
of freedom, and ai is the empirical variance of variable i. Then we can prove the following 
theorem: 

Theorem 2 Using \{a) the penalty parameter in ([7p, for any fixed level a, 

P{3ke{l,...,p}:Cl^Ck)<a. 



Observe that, for a fixed problem size p, as the number of samples n increases to infinity, 
the penalty parameter A(a) decreases to zero. Thus, asymptotically we recover the clas- 



5 



Banerjee, El Ghaoui, and d'Aspremont 



sical maximum likelihood estimate, S, which in turn converges in probability to the true 
covariance S. 



3. Block Coordinate Descent Algorithm 

In this section we present an algorithm for solving ([5]) that uses block coordinate descent. 

3.1 Algorithm description. 

We begin by detailing the algorithm. For any symmetric matrix A, let denote the 

matrix produced by removing row k and column j. Let Aj denote column j with the 
diagonal element Ajj removed. The plan is to optimize over one row and column of the 
variable matrix at a time, and to repeatedly sweep through all columns until we achieve 
convergence. 

Initialize: VF^ ■= S + XI 
For k>0 

1. For j = l,...,p 

(a) Let W^^~^^ denote the current iterate. Solve the quadratic program 

y := argmm{/(H^(jg'Vy : Wv " Sj\\oo < A} (4) 

(b) Update rule: W^^^ is W^^~^^ with column/row Wj replaced by y. 

2. Let := W^pI 

3. After each sweep through all columns, check the convergence condition. Convergence 
occurs when 

trace((VF(°))-i5) - p + A||(iy(°))-i||i < e. (5) 

3.2 Convergence and property of solution. 

Using Schur complements, we can prove convergence: 

Theorem 3 The block coordinate descent algorithm described above converges, acheiving 
an e-suboptimal solution to In particular, the iterates produced by the algorithm are 

strictly positive definite: each time we sweep through the columns, W^^^ >- for all j. 
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The proof of Theorem [3] sheds some interesting hght on the solution to problem ([T]) . In 
particular, we can use this method to show that the solution has the following property: 



Theorem 4 Fix any k G {1, . . . ,p} . If X > \Skj\ for all j ^ k, then column and row k of 
the solution S to (0j are zero, excluding the diagonal element. 



This means that, for a given second moment matrix S, if A is chosen such that the condition 
in Theorem m is met for some column k, then the sparse maximum likelihood method esti- 
mates variable k to be independent of all other variables in the distribution. In particular, 
Theorem [3] implies that if A > \Skj\ for all k > j, then ([1]) estimates all variables in the 
distribution to be pairwise independent. 



Using the work of Luo and Tseng 1992l |. it may be possible to show that the local conver- 



gence rate of this method is at least linear. In practice we have found that a small number 
of sweeps through all columns, independent of problem size p, is sufhcient to achieve con- 
vergence. For a fixed number of K sweeps, the cost of the method is 0{Kp'^), since each 
iteration costs 0{p^). 



3.3 Interpretation as recursive penalized regression. 

The dual of © is 

mmx^iy^^jg^^x - Sjx + A||x||i. (6) 



Strong duality obtains so that problems ([6]) and dH) are equivalent. If we let Q denote the 
'(j-/Und 6 := iQ-i5„ 



square root of and b := \Q ^Sj, then we can write ([6]) as 



min HQx — ^lln + A||x||i. (7) 

X 



The problem d?]) is a penalized least-squares problems, known as the Lasso. If were 
the j-th principal minor of the sample covariance S, then ^ would be equivalent to a 
penalized regression of v ariable j against all others. Thus, the approach is reminiscent of 



the approach explored bv lMeinshausen and Biihlmannl 2006l |. but there are two differences. 



First, we begin with some regularization and, as a consequence, each penalized regression 
problem has a unique solution. Second, and more importantly, we update the problem data 
after each regression: except at the very first update, W^js^j is never a minor of S. In this 
sense, the coordinate descent method can be interpreted as a recursive Lasso. 
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4. Nesterov's First Order Method 



In this section we apply the recent results due to lNesterovl [20051 ] to obtain a first order algo- 
rithm for solving ([1]) with lower memory requirements and a rigorous complexity estimate 
with a better dependence on problem size than those offered by interior point methods. 
Our purpose is not to obtain another algorithm, as we have found that the block coordinate 
descent is fairly efficient; rather, we seek to use Nesterov's formalism to derive a rigorous 
complexity estimate for the problem, improved over that offered by interior-point methods. 

As we will see, Nesterov's framework allows us to obtain an algorithm that has a complexity 
of 0(j)^'^/e), where e > is the desired accuracy on the objective of problem ([T]). This is 
in contrast to the complexity of interior-point methods, 0(p^ log(l/e)). Thus, Nesterov's 
method provides a much better dependence on problem size and lower memory requirements 
at the expense of a degraded dependence on accuracy. 



4.1 Idea of Nesterov's method. 

Nesterov's method applies to a class of non-smooth, convex optimization problems of the 
form 

min{/(x) : x £ Qi} (8) 

X 

where the objective function can be written as 

f{x) = f{x) + max{{Ax, u)2 ■ u e Q2}. 



Here, Qi and Q2 are bounded, closed, convex sets, /(x) is differentiable (with a Lipschitz- 
continuous gradient) and convex on Qi, and A is a linear operator. The challenge is to 
write our problem in the appropriate form and choose associated functions and parameters 
in such a w ay as to ob t ain th e best possible complexity estimate, by applying general results 



obtained by iNesterovl |2005l |. 



Observe that we can write ([T]) in the form ([8]) if we impose bounds on the eigenvalues of 
the solution, X. To this end, we let 

Qi:={x:al ^ bl} 

(9) 

Q2 := {u : ||n||oo < A} 

where the constants a, b are given such that 6 > a > 0. By Theorem [H we know that such 
bounds always exist. We also define f{x) := — logdetx + {S,x), and A := \I. 

To Qi and Q2, we associate norms and continuous, strongly convex functions, called prox- 
functions, di{x) and d2{u). For Qi we choose the Frobenius norm, and a prox-function 
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di{x) = — logdetx + log 6. For Q2, we choose the Frobenius norm again, and a prox- 
function ^2(2;) = ||ii|||,/2. 

The method applies a smoothing technique to the non-smooth problem ([8]), which replaces 
the objective of the original problem, f{x), by a penalized function involving the prox- 
function d2{u): 

fix) = fix) + max{{Ax,u) - ^dsM}- (10) 

ueQ2 

The above function turns out to be a smooth uniform approximation to / everywhere. It 
is differentiable, convex on Qi, and a has a Lipschitz-continuous gradient, with a constant 
L that can be computed as detailed below. A specific gradient scheme is then applied to 
this smooth approximation, with convergence rate 0(L/e). 

4.2 Algorithm and complexity estimate. 

To detail the algorithm and compute the complexity, we must first calculate some pa- 
rameters corresponding to our definitions and choices above. First, the strong convexity 
parameter for diix) on Qi is ai = 1/6^, in the sense that 

V'^diiX)[H,H] = traceiX-'^ HX-'^H) > b-'^\\H\\l 

for every symmetric H. Furthermore, the center of the set Qi is xq := argmiUajgQ^ diix) = 
bl, and satisfies diixo) = 0. With our choice, we have Di := maxjjgg^ diix) = plog(6/a). 

Similarly, the strong convexity parameter for d2iu) on Q2 is o"2 := 1, and we have 

D2 := maxd2iU) = /2. 
U&Q2 

With this choice, the center of the set Q2 is no := argminu^Qj d2iu) = 0. 

For a desired accuracy e, we set the smoothness parameter /i := e/2Z)2, and set xq = bl. 
The algorithm proceeds as follows: 

For A: > do 

1. Compute Vfixk) = —x~^ + S + u*ixk), where solves ([10]). 

2. Find Vk = argminy {{Vfixk),y - Xk) + ^Lie)\\y - XkWp : y G Qi}. 

3. Findzfe = argmin,{^di(X) + ^to^(W>0,x-Xi) : x e Qi}. 
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4. Update Xfc = ^zj, + ^Vk- 

In our case, the Lipschitz constant for the gradient of our smooth approximation to the 
objective function is 



L(e) := M + D2\\A\\'/i2a2€ 



where M := is the Lipschitz constant for the gradient of /, and the norm ||yl|| is 

induced by the Frobenius norm, and is equal to A. 

The algorithm is guaranteed to produce an e-suboptimal solution after a number of steps 
not exceeding 

Nie) := 



4p|L/llif^^.l + ,/M^ (11) 

= (Ky(b^)(V-5aA/^/2 + ^)/e. 
where k = 6/a is a bound on the condition number of the solution. 

Now we are ready to estimate the complexity of the algorithm. For Step 1, the gradient of 
the smooth approximation is computed in closed form by taking the inverse of x. Step 2 
essentially amounts to projecting on Qi, and requires that we solve an eigenvalue problem. 
The same is true for Step 3. In fact, each iteration costs 0{p^). The number of iterations 
necessary to achieve an objective with absolute accuracy less than e is given in (jlip by 
A^(e) = 0{p^'^/e). Thus, if the condition number k is fixed in advance, the complexity of 
the algorithm is 0{p'^'^/e). 



5. Binary Variables: Approximate Sparse Maximum Likelihood 
Estimation 



In this section, we consider the pr oblem of estimating an undirected graphical model for 
multivariate binary data. Recently, Wainwright et al. 200d | applied an £i-norm penalty to 



the logisti c regression problem to obtain a bin ary version of the high-dimensional consistency 
results of Meinshausen and Biihlmann 2006l |. We apply the log determinant relaxation of 



Wainwright and JordanI |2006l | to formulate an approximate sparse maximum likelihood 
(ASML) problem for estimating the parameters in a multivariate binary distribution. We 
show that the resulting problem is the same as the Gaussian sparse maximum likelihood 
(SML) problem, and that we can therefore apply our previously-developed algorithms to 
sparse model selection in a binary setting. 
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Consider a distribution made up of p binary random variables. Using n data samples, we 
wish to estimate the structure of the distribution. The logistic model of this distribution is 



p p^i p 

p{x] 9) = expiy^ OiXi + X] ~ ^(^)} 

i=l i=l j=i+l 

where 

p p— 1 p 

^(6*) = log ^ expiy^g^x^ + 

x£XP 1=1 i=l j=i+l 

is the log partition function. 



(12) 



(13) 



The sparse maximum likelihood problem in this case is to maximize (jl2p with an added 
^i-norm penalty on terms O^j. Specifically, in the undirected graphical model, an edge 
between nodes k and j is missing if 9kj = 0. 

A well-known difficulty is that the log partition function has too many terms in its outer 
sum to compute. Ho wever, if we use the log determ inant relaxation for the log partition 
function developed by Wainwright and Jordan 200d | , we can obtain an approximate sparse 
maximum likelihood (ASML) estimate. We shall set up the problem in the next section. 



5.1 Problem formulation. 

Let's begin with some notation. Letting d := p{p+ l)/2, define the map R : R'' S^^'^ as 
follows: 



RiO) 



/ 01 02 

•l 012 



hp 



\0p 01p 02p ■ ■ ■ / 



Suppose that our n samples are z^^ ),..., z^"''> G {— 1,+!}*'. Let Zi and Zij denote sample 
mean and second moments. The sparse maximum likelihood problem is 



^exact ■■= ^Tgmax-{R{e),R{z)) - A{e) - X\\0\\i. 



(14) 



Finally define the constant vector m = (1, |, . . . , |) S R^+^. Wainwright and Jordan 2006l | 
give an upper bound on the log partition function as the solution to the following variational 
problem: 

A{9) < max^ i log det(/?(;u) + diag(m)) + {9, jj) 

(15) 

= i • max^ log det(i?(/i) + diag(m)) + {R{9) , R{fi)) . 
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If we use the bound (jlSp in our sparse maximum likelihood problem ()14p . we won't be able 
to extract an optimizing argument 9. Our first step, therefore, will be to rewrite the bound 
in a form that will allow this. 

Lemma 5 We can rewrite the hound M5\) as 

A{e) < |log(^)-i(p+l)-^•{maxl.^m + logdet(-(i^(0) + d^a5H)). (16) 
2 2 2 2 V 



Using this version of the bound ()15p . we have the following theorem. 



Theorem 6 Using the upper bound on the log partition function given in \16\) . the approx- 
imate sparse maximum likelihood problem has the following solution: 



(17) 



where the matrix T is the solution to the following problem, related to (0); 

f := argmax{log det W : Wkk = Skk + \Wkj - Skj\ < A}. (18) 



Here, S is defined as before: 

where fi is the vector of sample means Zj. 

In particular, this means that we can reuse the algorithms developed in Sections [3] and 
?? for problems with binary variables. The relaxation ()15p is the simplest one offered by 
Wainwright and Jordan 2006l | . The relaxation can be tightened by adding linear constraints 



on the variable /i. 



5.2 Penalty parameter choice for binary variables. 



For the choice of the penalty parameter A, we can derive a formula analogous to ([3]). 
Consider the choice 

(x^(a/2pM))l 
^(")bin - J—;—-^ (19) 
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where x^(q?, 1) is the (100 — a)% point of the chi-square distribution for one degree of 
freedom. Since our variables take on values in { — 1,1}, the empirical variances are of the 
form: 

Using p9p , we have the following binary version of Theorem [2l 

Theorem 7 With U9\) chosen as the penalty parameter in the approximate sparse maxi- 
mum likelihood problem, for a fixed level a, 

P{3kG{l,...,p}:C^^Ck)<a. 



6. Numerical Results 

In this section we present the results of some numerical experiments, both on synthetic and 
real data. 



6.1 Synthetic experiments. 

Synthetic experiments require that we generate underlying sparse inverse covariance matri- 
ces. To this end, we first randomly choose a diagonal matrix with positive diagonal entries. 
A given number of nonzeros are inserted in the matrix at random locations symmetrically. 
Positive definiteness is ensured by adding a multiple of the identity to the matrix if needed. 
The multiple is chosen to be only as large as necessary for inversion with no errors. 

6.1.1 Sparsity and thresholding. 

A very simple approach to obtaining a sparse estimate of the inverse covariance matrix 
would be to apply a threshold to the inverse empirical covariance matrix, S~^. However, 
even when S is easily invertible, it can be difficult to select a threshold level. We solved a 
synthetic problem of size p = 100 where the true concentration matrix density was set to 
6 = 0.1. Drawing n = 200 samples, we plot in Figure ([T]) the sorted absolute value elements 
of S^^ on the left and on the right. 

It is clearly easier to choose a threshold level for the SML estimate. Applying a threshold to 
either or would decrease the log likelihood of the estimate by an unknown amount. 
We only observe that to preserve positive definiteness, the threshold level t must satisfy the 
bound 

t < min v'^S~^v. 

\\v\\i<l 
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Figure 1: Sorted absolute value of elements of (A) S ^ and (B) S ^. The solution S ^ to 
([1]) is un-tliresholded. 



6.1.2 Recovering structure. 



We begin with a small experiment to test the ability of the method to recover the sparse 
structure of an underlying covariance matrix. Figure[2](A) shows a sparse inverse covariance 
matrix of size p = 30. Figure [2] (B) displays a corresponding S~^, using n = 60 samples. 
Figure [2] (C) displays the solution to ([1]) for A = 0.1. The value of the penalty parameter 
here is chosen arbitrarily, and the solution is not thresholded. Nevertheless, we can still 
pick out features that were present in the true underlying inverse covariance matrix. 




Figure 2: Recovering the sparsity pattern. We plot (A) the original inverse covariance 
matrix S"^, (B) the noisy sample inverse S~^, and (C) the solution to problem 
dH) for A = 0.1. 

Using the same underlying inverse covariance matrix, we repeat the experiment using 
smaller sample sizes. We solve ([T]) for n = 30 and n = 20 using the same arbitrarily 
chosen penalty parameter value A = 0.1, and display the solutions in Figure ([3]). As ex- 
pected, our ability to pick out features of the true inverse covariance matrix diminishes with 
the number of samples. This is an added reason to choose a larger value of A when we have 
fewer samples, as in ([3]). 
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ABC 




10 20 30 10 20 30 10 20 30 



Figure 3: Recovering the sparsity pattern for small sample size. We plot (A) the original 
inverse covariance matrix (B) the solution to problem ([1]) for n = 30 and 

(C) the solution for n = 20. A penalty parameter of A = 0.1 is used for (B) and 
(C). 




Figure 4: Path following: elements of solution to ([T]) as A increases. Red lines correspond to 
elements that are zero in the true inverse covariance matrix; blue lines correspond 
to true nonzeros. Vertical lines mark a range of A values using which we recover 
the sparsity pattern exactly. 



6.1.3 Path following experiments. 

Figure @ shows two path following examples. We solve two randomly generated problems 
of size p = 5 and n = 100 samples. The red lines correspond to elements of the solution 
that are zero in the true underlying inverse covariance matrix. The blue lines correspond 
to true nonzeros. The vertical lines mark ranges of A for which we recover the correct 
sparsity pattern exactly. Note that, by Theorem IU for A values greater than those shown, 
the solution will be diagonal. 
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Figure 5: Recovering sparsity pattern in a matrix with added uniform noise of size a = 0.1. 

We plot the average percentage or misclassified entries as a function of log{X/a). 



On a related note, we observe that ([T]) also works well in recovering the sparsity pattern of a 
matrix masked by noise. The following experiment illustrates this observation. We generate 
a sparse inverse covariance matrix of size p = 50 as described above. Then, instead of using 
an empirical covariance S as input to ([1]), we use S = + V)~^, where 1^ is a randomly 
generated uniform noise of size a = 0.1. We then solve ([1]) for various values of the penalty 
parameter A. 

In figure O for a each of value of A shown, we randomly selected 10 sample covariance ma- 
trices S of size p = 50 and computed the number of misclassified zeros and nonzero elements 
in the solution to ([T]). We plot the average percentage of errors (number of misclassified 
zeros plus misclassified nonzeros divided by p^), as well as error bars corresponding to one 
standard deviation. As shown, the error rate is nearly zero on average when the penalty is 
set to equal the noise level a. 

6.1.4 CPU TIMES VERSUS PROBLEM SIZE. 

For a sense of the practical performance of the Nesterov method and the block coordinate 
descent method, we randomly selected 10 sample covariance matrices S for problem sizes p 
ranging from 400 to 1000. In each case, the number of samples n was chosen to be about a 
third of p. In figure [6] we plot the average CPU time to achieve a duality gap of e = 1. CPU 
times were computed using an AMD Athlon 64 2.20Ghz processor with 1.96GB of RAM. 
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Figure 6: Average CPU times vs. problem size using block coordinate descent. We plot the 
average CPU time (in seconds) to reach a gap of e = 0.1 versus problem size p. 



As shown, we are typically able to solve a problem of size p = 1000 in about two and half 
hours. 



6.1.5 Performance as a binary classifier. 

In this section we numerically examine the ability of the sparse maximum likelihood (SML) 
method to correctly classify elements of the inver se covariance matrix as zero or nonze ro. For 
comparision, we will use the Lasso estimate of Meinshausen and Biihlmann 2006l |. which 



has been shown to perform extremely well. The Lasso regresses each variable against all 
others one at a time. Upon obtaining a solution 0^*^^ for each variable k, one can estimate 

(k) 

sparsity in one of two ways: either by declaring an element Sjj nonzero if both ei^' ^ 
and Oj'^ 7^ (Lasso-AND) or, less conservatively, if either of those quantities is nonzero 
(Lasso-OR). 



As noted previously, iMeinshausen and Biihlmannl 2006l | have also derived a formula for 



choosing their penalty parameter. Both the SML and Lasso penalty parameter formulas 
depend on a chosen level a, which is a bound on the same error probability for each method. 
For these experiments, we set a = 0.05. 

In the following experiments, we fixed the problem size p at 30 and generated sparse un- 
derlying inverse covariance matrices as described above. We varied the number of samples 
n from 10 to 310. For each value of n shown, we ran 30 trials in which we estimated the 
sparsity pattern of the inverse covariance matrix using the SML, Lasso-OR, and Lasso-AND 
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Figure 7: Classifying zeros and nonzeros for a true density of S = 0.05. We plot the positive 
predictive value, the power, and the estimated density using SML, Lasso-OR and 
Lasso- AND. 



methods. We then recorded the average number of nonzeros estimated by each method, 
and the average number of entries correctly identified as nonzero (true positives). 

We show two sets of plots. Figure ([7|) corresponds to experiments where the true density 
was set to be low, 5 = 0.05. We plot the power (proportion of correctly identified nonzeros), 
positive predictive value (proportion of estimated nonzeros that are correct), and the density 
estimated by each method. Figure ([8]) corresponds to experiments where the true density 
was set to be high, 5 = 0.40, and we plot the same three quantities. 
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Figure 8: Classifying zeros and nonzeros for a true density of 6 = 0.40. We plot the positive 
predictive value, the power, and the estimated density using SML, Lasso-OR and 
Lasso- AND. 



Meinshausen and Biihlmann 20061 ] report that, asymptotically, Lasso-AND and Lasso-OR 



yield the same estimate of the sparsity pattern of the inverse covariance matrix. At a finite 
number of samples, the SML method seems to fall in in between the two methods in terms 
of power, positive predictive value, and the density of the estimate. It typically offers, on 
average, the lowest total number of errors, tied with either Lasso-AND or Lasso-OR. Among 
the two Lasso methods, it would seem that if the true density is very low, it is slightly better 
to use the more conservative Lasso-AND. If the density is higher, it may be better to use 
Lasso-OR. When the true density is unknown, we can achieve an accuracy comparable to 
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the better choice among the Lasso methods by computing the SML estimate, 
shows one example of sparsity pattern recovery when the true density is low. 



Figure 



t • » 



Figure 9: Comparing sparsity pattern recovery to the Lasso. (A) true covariance (B) Lasso- 
OR (C) Lasso-AND (D) SML. 



The Lasso and SML methods have a comparable computational complexity. However, unlike 
the Lasso, the SML method is not parallelizable. Parallelization would render the Lasso a 
more computationally attractive choice, since each variable can regressed against all other 
separately, at an individual cost of 0{p^). In exchange, SML can offer a more accurate 
estimate of the sparsity pattern, as well as a well-conditioned estimate of the covariance 
matrix itself. 



6.2 Gene expression and U.S. Senate voting records data. 



We tested our algorithms on three sets of data: two gene expression data sets, as well as 
US Senate voting records. In this section we briefly explore the resulting graphical models. 
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6.2.1 ROSETTA InPHARMATICS COMPENDIUM. 



We applied our algo rithms to the Roset ta Inpharmatics Compendium of gene expression 
profiles described by Hughes et al. 200Cll ]. The 300 experiment compendium contains n = 
253 samples with p = 6136 variables. With a view towards obtaining a very sparse graph, 
we replaced a/2p^ in ([3]) by a, and set a = 0.05. The resulting penalty parameter is 
A = 0.0313. 



This is a large penalty for this data set, and by applying Theorem[4]we find that all but 270 
of the variables are estimated to be independent from all the rest, clearly a very conservative 
estimate. Figure ([TO]) displays the resulting graph. 




Figure 10: Application to Hughes compendium. The above graph results from solving ([T]) 
for this data set with a penalty parameter of A = 0.0313. 



Figure (fTTIl closes in on a region of Figure (fTOl) , a cluster of genes tha t is unconne c ted t o 
the remaining genes in this estimate. According to Gene Ontology [see IConsortiuml . l2000f | . 
these genes are associated with iron homeostasis. The probability that a gene has been false 
included in this cluster is at most 0.05. 



As a second example, in Figure (I12p . we show a subgraph of genes associated with cellular 
membrane fusion. All three graphs were rendered using Cytoscape. 
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Figure 11: Application to Hughes dataset (closeup of Figure (fTOj) . These genes are associ- 
ated with iron homeostasis. 




Figure 12: Application to Hughes dataset (subgraph of Figure pO|) . These genes are asso- 
ciated with cellular membrane fusion. 



22 



Model Selection Through Sparse Max Likelihood Estimation 



Table 1: Predictor genes for LDL receptor. 



Accession 



Gene 



BF553500 Cbp/p300-interacting transactivator 

BF387347 EST 

BF405996 CALCIUM CHANNEL, VOLTAGE DEPENDENT 

NM_017158 CYTOCHROME P450, 2c39 

K03249 ENOYL-CoA, hydratase/3-hydroxyacyl Co A dehydrog. 

BE100965 EST 

AI411979 Carnitine O-acetyltransferase 

AI410548 3-hydroxyisobutyryl-Co A hydrolase 

NM_017288 SODIUM channel, voltage-gated 

Y00102 ESTROGEN RECEPTOR 1 

NM 013200 CARNITINE PALMITOYLTRANSFERASE 1b 



6.2.2 ICONIX MICRO ARRAY DATA. 



Next we analyzed a subset of a 10, 000 gene microarray dataset from 160 drug treated rat 
livers iNatsoulis et al.l 20051 ]. In this study, rats were treated with a variety of fibrate, statin, 
or estrogen receptor agonist compounds. Taking the 500 genes with the highest variance, we 
once again replaced a/2p^ in ([3]) by a, and set a = 0.05. The resulting penalty parameter 
is A = 0.0853. 



By applying Theorem |4] we find that all but 339 of the variables are estimated to be inde- 
pendent from the rest. This estimate is less conservative than that obtained in the Hughes 
case since the ratio of samples to variables is 160 to 500 instead of 253 to 6136. 

The first order neighbors of any node in a Gaussian graphical model form the set of pre- 
dictors for that variable. In the estimated obtained by solving ([T]), we found that LDL 
receptor had one of the largest number of first-order neighbors in the Gaussian graphical 
model. The LDL receptor is believed to be one of the key mediators of the effect of both 
statins and estrogenic compounds on LDL cholesterol. Table [1] lists some of the first order 
neighbors of LDL receptor. 

It is perhaps not surprising that several of these genes are directly involved in either lipid 
or steroid metabolism (K03249, AI411979, AI410548, NM_013200, Y00102). Other genes 
such as Cbp/p300 are known to be global transcriptional regulators. Finally, some are un- 
annotated ESTs. Their connection to the LDL receptor in this analysis may provide clues 
to their function. 
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6.2.3 Senate voting records data. 

We conclude our numerical experiments by testing our approximate sparse maximum likeli- 
hood estimation method on binary data. The data set consists of US senate voting records 
data from the 109th congress (2004 - 2006). There are one hundred variables, correspond- 
ing to 100 senators. Each of the 542 samples is bill that was put to a vote. The votes are 
recorded as -1 for no and 1 for yes. 

There are many missing values in this dataset, corresponding to missed votes. Since our 
analysis depends on data values taken solely from { — 1, 1}, it was necessary to impute values 
to these. For this experiment, we replaced all missing votes with noes (-1). We chose the 
penalty parameter A(a) according to p^ . using a significance level of a = 0.05. Figure 
shows the resulting graphical model, rendered using Cytoscape. Red nodes correspond 
to Republican senators, and blue nodes correspond to Democratic senators. 

We can make some tentative observations by browsing the network of senators. As neighbors 
most Democrats have only other Democrats and Republicans have only other Republicans. 
Senator Chafee (R, RI) has only democrats as his neighbors, an observation that supports 
media statements made by and about Chafee during those years. Senator Allen (R, VA) 
unites two otherwise separate groups of Republicans and also provides a connection to 
the large cluster of Democrats through Ben Nelson (D, NE). Senator Lieberman (D, CT) 
is connected to other Democrats only through Kerry (D, MA), his running mate in the 
2004 presidential election. These observations also match media statements made by both 
pundits and politicians. Thus, although we obtained this graphical model via a relaxation 
of the log partition function, the resulting picture is largely supported by conventional 
wisdom. Figure (|14p shows a subgraph consisting of neighbors of degree three or lower of 
Senator Allen. 
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Figure 13: US Senate, 109th Congress (2004-2006). The graph displays the solution to 
(I14p obtained using the log determ inant relaxation to the log partition function 
of IWainwright and Jordan! 2006[ |. Democratic senators are colored blue and 
Republican senators are colored red. 
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Figure 14: US Senate, 109th Congress. Neighbors of Senator Ahen (degree three or lower). 
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Appendix A. Proof of solution properties and block coordinate descent 
convergence 

Li this section, we give short proofs of the two theorems on properties of the solution to 
([I]), as well as the convergence of the block coordinate descent method. 

Proof of Theorem [1} 

Since S satisfies II = 5 + C/, where ||C/||oo < A, we have: 

||S||2 = \\S + U\\2 

< \\S\\2 + \\U\\2 < ||5||2 + ||C/||oo < \\S\\2 + Xp 

which yields the lower bound on ||E~^||2. Likewise, we can show that ||S~^||2 is bounded 
above. At the optimum, the primal dual gap is zero: 

-logdetS-i + trace(S'S-i) + AUS-^i - logdetS - p 
= trace(5S-i) + A||S-i||i - p = 
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We therefore have 

"^^^ '|2 ^ 11^ 



= p/X - tTace{S±-^)/X < p/X 
where the last inequahty follows from trace(S'S~^) > 0, since 5 ^ and y 0. U 

Next we prove the convergence of block coordinate descent: 
Proof of Theorem [3} 

To see that optimizing over one row and column of in ([2]) yields the quadratic program 
([U, let all but the last row and column of W be fixed. Since we know the diagonal entries 
of the solution, we can fix the remaining diagonal entry as well: 

V < WppJ 

Then, using Schur complements, we have that 

det W = det W\p\p ■ (Wpp - Wp {W\p\j,)''^Wp) 

which gives rise to Q. 



By general results on block coordinate descent algorithms [e.g., Bertsekas . 1998 1] . the algo- 



rithms converges if and only if ^ has a unique solution at each iteration. Thus it suffices 
to show that, at every sweep, W^^^ >- for all columns j. Prior to the first sweep, the initial 
value of the variable is positive definite: W^^^ >- since -.= 3 + XI, and we have S 
and A > by assumption. 

Now suppose that W^^^ >- 0. This implies that the following Schur complement is positive: 

By the update rule we have that the corresponding Schur complement for W^^^^^ is even 
greater: 

y^n - wJ{w{^:!;Pr'w, > w,, - wjiw^{.)-'w, > 

so that ^0. ■ 



Finally, we apply Theorem [3] to prove the second property of the solution. 
Proof of Theorem [4} 
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Suppose that column j of the second moment matrix satisfies \Sij\ < A for all i ^ j- This 
means that the zero vector is in the constraint set of ^ for that column. Each time we 
return to column j, the objective function will be different, but always of the form y'^Ay 
for A y 0. Since the constraint set will not change, the solution for column j will always be 
zero. By Theorem [3l the block coordinate descent algorithm converges to a solution, and 
so therefore the solution must have = 0. ■ 



Appendix B. Proof of error bounds. 



Next we shall show that the penalty parameter choice given in ([31) yields the error probability 
bound of Theorem [2l The proof is nearly identical to that of Meinshausen and Biihlmannl . 



2006l .Theorem 3]. The differences stem from a different objective function, and the fact 



that our variable is a matrix of size p rather than a vector of size p. Our proof is only an 
adaptation of their proof to our problem. 

B.l Preliminaries 



Before we begin, consider problem ([T]), for a matrix S of any size: 

X = argmin — log det X + trace(S'X) + A||X||i 

where we have dropped the constraint X y since it is implicit, due to the log determinant 
function. Since the problem is unconstrained, the solution X must correspond to setting 
the subgradient of the objective to zero: 

Sij - X-^ = -A for Xij > 

Sij - Xr^ = A for Xij < (20) 

\Sij-X-.^\<X forXij = 
Recall that by Theorem [H the solution is unique for A positive. 



B.2 Proof of error bound for Gaussian data. 



Now we are ready to prove Theorem [2j 
Proof of Theorem [2} 

Sort columns of the covariance matrix so that variables in the same connectivity component 
are grouped together. The correct zero pattern for the covariance matrix is then block 
diagonal. Define 

^.correct _ ^Ik diag(Ci, . . . , Q) (21) 
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The inverse (Y!,^oicicect^-i jj^^g^ ^jgQ ]-,g block diagonal, with possible additional zeros inside 
the blocks. If we constrain the solution to ([T]) to have this structure, then by the form of 
the objective, we can optimize over each block separately. For each block, the solution is 
characterized by ([20]). 



Now, suppose that 

A> max Effect I (22) 

ieN,j€N\Ci ■' 

Then, by the subgradient characterization of the solution noted above, and the fact that 
the solution is unique for A > 0, it must be the case that t = T,^OTTect_ gy ^.j^g definition 
of scorrect^ ^j^jg i^phes that, for E, we have Ck = Ck for all k e N. 

Taking the contrapositive of this statement, we can write: 
Pi3k eN:Ck^Ck) 

< P(max,,^,,,^\c. 1^.. - > A) 

(23) 

< pHn) . max,e^,,e^\c. P{\S^J - SCorrect| > 
= p^{n) ■ maXi(,N,jeN\c, P{\Sij\ > A) 

The equality at the end follows since, by definition, = for j £ N\Ci. It remains 

to bound P{\S^j\ > A). 

The statement \Skj\ > A can be written as: 

\Rkj\{l - Rlj)--^ > X{skkSjj - A2)-^ 
where Rkj is the correlation between variables k and j, since 

\Rkj\il - Rl,)'-^ = \Skj\iSkkSjj - Sl^)--2 

Furthermore, the condition j G N\Ck is equivalent to saying that variables k and j are 
independent: E/jj = 0. Conditional on this, the statistic 



Rkj{l-Rij)-Hn-2)-2 



has a Student's t-distribution for n — 2 degrees of freedom. Therefore, for all j £ N\Ck, 
P{\Skj\ ^ A|S'fcfe = Skk,Sjj = Sjj) 

= 2P(Tn-2 > K^kkSjj - A2)"5(n - 2)^\Skk = Skk,Sjj = sjj) (24) 
< 2Fn-2{X{^la] - A2)-I(n - 2)^) 
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where a'j. is the sample variance of variable k, and Fn-2 = 1 — is the CDF of the 

Student's t-distribution with n — 2 degree of freedom. This implies that, for all j £ N\Ck, 

P{\Skj\ > A) < 2Fn-2iXiala] - X^Hn - 2)^) 

since P{A) = J P{A\B)P{B)dB < K J P{B)dB = K. Putting the inequahties together, 
we have that: 

P{3k : % Ck) 

< p2 . uia^kj€N\Ck 2^n-2(A(<T|a| - A2)-5(n - 2)5) 

= 2p2F„_2(A((n - 2)/((maxi>j ak&jf - A^))!) 
For any fixed a, our required condition on A is therefore 

F„_2(A((n - 2)/((maxafc(jj)2 - A^))!) = a/2p2 

i>j 

which is satisfied by choosing A according to ([3]). ■ 
B.3 Proof of bound for binary data. 

We can reuse much of the previous proof to derive a corresponding formula for the binary 
case. 

Proof of Theorem [3 

The proof of Theorem [7| is identical to the proof of Theorem [2l except that we have a 
different null distribution for \Skj\- The null distribution of 

is chi-squared with one degree of freedom. Analogous to (j24p . we have: 

-Pd'S'fejr'l ^ X\Skk = SkkiSjj = Sjj) 

= 2P{nRl - > n\^SkkSjj\Skk = Skk,Sjj = Sjj) 
< 2G{nX'alaf) 

where a^. is the sample variance of variable k, and G = 1 — G is the CDF of the chi-squared 
distribution with one degree of freedom. This imphes that, for all j G N\Ck, 

P{\Skj\ > A) < 2G((A<Tfc<T,V^)2) 
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Putting the inequalities together, we have that: 

Pi3k : <l Ck) 

< ■ maxfcjgAr\c'fc 2G{{\crkaj^/nf) 
= 2p2G'((mini>j dkCfjfnX^) 
so that, for any fixed a, we can achieve our desired bound by choosing A(a) according to 

(nni). ■ 



Appendix C. Proof of connection between Gaussian SML and binary 
ASML 



We end with a proof of Theorem [6l which connects the exact Gaussian sparse maximum 
hkehhood problem with the approxim a,te sparse maximum likelihood problem obtained by 
using the log determinant relaxation of Wainwright and Jordan 2006f | . First we must prove 



Lemma [5l 

Proof of Lemma [5} 

The conjugate function for the convex normalization A{9) is defined as 

yl*(/i):=sup{(/x,e)-yl(0)} (25) 
e 

Wainwright and Jordan derive a lower bound on this conjugate function using an entropy 
bound: 

A*(/.)>i3*(/i) (26) 

Since our original variables are spin variables x {—1, +1}, the bound given in the paper is 

B*{^i) := -ilogdet(i?(^) +diag(m)) - |log(y) (27) 

where m := (1, |, . . . , |). 

The dual of this lower bound is B{6): 

B*{n) :=maxe(0,/i) -5(0) 

(28) 

< maxg{e,fi) - A{d) =: A*{fi) 
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This means that, for all fi,9, 



{e,fi)-B{e)<A*{fi) 



(29) 



or 



(30) 



so that in particular 



B{e) > max{9,fi) -A*{fi) =: A{e) 



(31) 



Using the definition of B{9) and its dual B*{^), we can write 



B{e) ■.= max^{e,fi) -B*{^i) 

= 2 log(f ) + max^ ^{R{e), R{fi)) + ^ logdet{R{^l) + diag(m)) 
= |log(f ) + i •max{(i?(e),X -diag(m)) +logdet(X) : X ^ 0,diag(X) = m} 
= \ log(f^) + \ • {maxx^^o mini,(i?(6'), X - diag(m)) + log det(X) + zy^(diag(X) - m)} 
= |log(f ) + \ ■ {maxx^omin^(i?(^) +diag(i/),X) +logdet(X) - v^m} 
= I log(f ) + i • {min^ -u'^m + maxx^o(-R(^) + diag(z^), X) + log det(X)} 
= I log(f ) + i . {min^ -i^^m - log det(-(i?(0) + diag(z.))) - (p + 1)} 
= £ log(f ) - i(p + 1) + i • {min^ -i^^m - logdet(-(i?(0) + diag(z.)))} 
= I log(f ) - i(p + 1) - i • {max^ i^^m + logdet(-(i?(e) + diag(i/A)} 



Now we use lemma [5] to prove the main result of section 15.11 Having expressed the upper 
bound on the log partition function as a constant minus a maximization problem will help 
when we formulate the sparse approximate maximum likelihood problem. 

Proof of Theorem [6} 



(32) 
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The approximate sparse maximum likelihood problem is obtained by replacing the log par- 
tition function A{9) with its upper bound B{9), as derived in lemma [5} 

n ■ {maxe ^{R{9),R(z)) - B{9) - X\\9\\i} 

= n ■ {maxe \{R{9), R{z)) - X\\9\\i + i(p + 1) - f log(f ) 

+\ ■ {max^ u'^m + log det(-(ii(6i) + diag(z^)))}} (33) 

= f (p + 1) - f log(f ) + I ■ max,,,{i.^m + {R{9), R{z)) 

+ logdet(-(i?(0) +diag(i/))) -2A||0||i} 

We can collect the variables 9 and v into an unconstrained symmetric matrix variable 
Y := -(i?(0) + diag(z.)). 



Observe that 



and that 



{R{9),R{-z)) = {-Y -dis.g{u),R{-z)) 

= -{Y,R{z)) - {dia.g{u),R{z)) = -{Y,R{z)) 

v'^m = (diag(z/), diag(z^)) = (—1" — diag(m)) 
= — (y, diag(m)) — diag(m)) = — (y, diag(m)) 



(34) 



(35) 



The approximate sparse maximum likelihood problem can then be written in terms of Y: 
§(p + 1) - f log(f ) + f • maxe,,{z.^m + {R{9),R{z)) 

+ logdet(-(i?(e) + diag(z.))) - 2A||0||i} 

(36) 

= §(p + 1) - f log(f ) + § • max{log det Y - {Y, R{z) + diag(m)) 
If we let M := R{z) + diag(m), then: 



where jj. is the sample mean and 



z = lyzW(zW)^ 

n ^ 

k=l 
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Due to the added |l term, we have that M >- for any data set. 
The problem can now be written as: 

p p+i 

Y := argmax{log det F - (F, M) - 2A ^ ^ \Yij\:Y y 0} (37) 

1=2 j=i+l 



Since we are only penalizing certain elements of the variable Y, the solution X of the dual 
problem to (j37p will be of the form: 



where ^ 

X := argmax{logdety : Vkk = Z^k + -, \Vkj - Zkj\ < A}. 

We can write an equivalent problem for estimating the covariance matrix. Define a new 
variable: 

T = V -fLfl^ 

Using this variable, and the fact that the second moment matrix about the mean, defined 
as before, can be written 

n 

-.T 



n 
k=l 



we obtain the formulation (jlSp . Using Schur complements, we see that our primal variable 
is of the form: 

^* * 



Y 



From our definition of the variable Y, we see that the parameters we are estimating, 9kj, 
are the negatives of the off-diagonal elements of F^^, which gives us (|17p . ■ 
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